Introduction

Coronavirus has obviously had a tremendous impact on our day to day lives; as has been the case with many countries, leaders have tried to balance the lives of people against the economic status and growth of the nation. We felt that one good way to measure the economic impact of coronavirus is through stocks - specifically the S&P 500. The reason we chose the S&P 500 is because we feel it has a good range of stocks. Particularly, we feel that it is a good measure because a significant portion of the S&P 500 is Big Tech, which has proven resilient to the coronavirus and thus reflects not only the general impact of covid but also of these tech companies. By reading the different factors and the correlated stock prices, we hope to be able to predict and find the extent to which these factors impact the stock market. The purpose of including other economic indicators was to gain a better personal understanding of how these indicators affect our market in general.

What is this data?

This data derives from analyzing various datasets regarding the US stock market, COVID-19 data (both US and China), and national unemployment and political attributes. We accumulated these to be evenly distributed across a period of 8 years, disregarding national holidays (which stock markets are closed on).

Where did it come from/ Why is it interesting to us?

What are the variables?

What is the goal of this model?

Ultimately, we want to find which of these predictor variables are most useful for predicting SPY’s closing price and what interactions between these predictors will yield us the most impactful model.

Methods

Loading in the Data

First, we load in our data and convert our categorical variables from character types to format types

# reading in our data file
spy500_predictors = read.csv("spy500_predictors.csv")
# These are our columns. 
# We are using the predictors: SPY-Volume, US New Daily Cases, US New Daily Deaths, US New Daily Test, China New Daily Cases, China New Daily Deaths, US Unemployment Rate, US Quarterly GDP, sitting President of US, US Senate majority party, VIX (Volatility Index) Open, VIX Close
colnames(spy500_predictors)
##  [1] "Date"                     "SPY...Close"             
##  [3] "SPY...Volume"             "US.New.Daily.Cases"      
##  [5] "US.New.Daily.Deaths"      "US.New.Daily.Tests"      
##  [7] "China.New.Daily.Cases"    "China.New.Daily.Deaths"  
##  [9] "US.Unemployment.Rate...." "US.Quarterly.GDP"        
## [11] "Sitting.President"        "Senate.Majority.Party"   
## [13] "VIX...Open"               "VIX...Close"             
## [15] "Season"
spy500_predictors$Sitting.President <- factor(spy500_predictors$Sitting.President)
spy500_predictors$Senate.Majority.Party <- factor(spy500_predictors$Senate.Majority.Party)
spy500_predictors$Season <- factor(spy500_predictors$Season)

Figuring Out Which Predictors to Use

Next, we create plots to observe the general correlation of each of our predictors to the closing price of SPY.

# Graphical Representation of the response variable vs. each of our predictor variables
par(mfrow=c(2,2))
plot(SPY...Close~1, pch = 20, data = spy500_predictors, xlab = "Days past 8/27/12", ylab = "SPY Close", main = "SPY Close Null Model") # this is our null model
plot(SPY...Close~US.Quarterly.GDP,pch = 20, data = spy500_predictors, xlab = "US Quarterly GDP (in $)", ylab = "SPY Close", main = "US Quarterly GDP vs SPY Close")
plot(SPY...Close~US.New.Daily.Cases,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Cases", ylab = "SPY Close", main = "US New Daily Covid Cases vs SPY Close", ylim = c(200, 400))
plot(SPY...Close~SPY...Volume,pch = 20, data = spy500_predictors,xlab = "SPY Volume", ylab = "SPY Close", main = "SPY Volume vs SPY Close")

plot(SPY...Close~US.New.Daily.Tests,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Tests", ylab = "SPY Close", main = "US New Daily Covid Tests vs SPY Close")
plot(SPY...Close~US.New.Daily.Deaths,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Deaths", ylab = "SPY Close", main = "US New Daily Covid Deaths vs SPY Close")
plot(SPY...Close~China.New.Daily.Cases,pch = 20, data = spy500_predictors, xlab = "China New Daily Covid Cases", ylab = "SPY Close", main = "China New Daily Covid Cases vs SPY Close")
plot(SPY...Close~China.New.Daily.Deaths,pch = 20, data = spy500_predictors, xlab = "US New Daily Covid Cases", ylab = "SPY Close", main = "US New Daily Covid Cases vs SPY Close")

plot(SPY...Close~US.Unemployment.Rate....,pch = 20, data = spy500_predictors, xlab = "US Unemployment Rate", ylab = "SPY Close", main = "US Unemployment Rate vs SPY Close")
plot(SPY...Close~VIX...Open,pch = 20, data = spy500_predictors, xlab = "VIX Open", ylab = "SPY Close", main = "Vix Open vs SPY Close")
plot(SPY...Close~VIX...Close,pch = 20, data = spy500_predictors, xlab = "VIX Close", ylab = "SPY Close", main = "Vix Close vs SPY Close")
plot(SPY...Close~Senate.Majority.Party, data = spy500_predictors, xlab = "Senate Majority Party (factored)", ylab = "SPY Close", main = "Senate Majority Party vs SPY Close")

plot(SPY...Close~Sitting.President, data = spy500_predictors, xlab = "Sitting President (factored)", ylab = "SPY Close", main = "Sitting President vs SPY Close")

Based on what we see in the graphs, we can observe that the GPD seems to be positively correlated with the SPY 500’s closing price and that the unemployment rate is generally negatively correlated (as expected). Ironically, it seems as though the the number of new daily COVID cases has had a positive impact on closing price as there was widespread panic when the US was at relatively low COVID cases and the economy now seems to have made a full recovery despite COVID cases being higher than ever. Thankfully however, we can observe that deaths still negatively impacted the price of SPY, and the disconnect between deaths and cases can be attributed to the lowering death rate as cases increased. We also see that the number of tests conducted per day appears to have a relatively strong positive correlation to the stock price.

Indicators like volume and the VIX are actually found to be less correlated than we had expected but must be further tested to truly observe their significance. Also, the data from China seems to have relatively little correlation.

One flaw we discovered while working on this project, which highlighted the ability of statistics to sometimes obscure the truth, was that based on the data frame we chose, it would appear that Donald Trump and the Republican Senate majority have overall been correlated with a higher SPY price over the last 8 years. However, what this fails to account for is that fact that the price of the SPY 500 goes up overall over time, so the last 8 years would be a very biased time frame with the Republican Senate and Donald Trump being associated with the $300+ SPY close priced of 2019 and beyond.

Single Predictor Linear Regression Models - For further evaluation of each predictors effectiveness when used alone

rmse  = function(m) {
  sqrt(mean(resid(m) ^ 2))
}

volume_model = lm(SPY...Close ~ SPY...Volume, data = spy500_predictors)
print("volume_model: Adj R-squared, R-Squared, RMSE")
summary(volume_model)$adj.r.squared
summary(volume_model)$r.squared
rmse(volume_model)

US_daily_cases_model = lm(SPY...Close ~ US.New.Daily.Cases, data = spy500_predictors)
print("US_daily_cases_model: ")
summary(US_daily_cases_model)$adj.r.squared
summary(US_daily_cases_model)$r.squared
rmse(US_daily_cases_model)

China_daily_cases_model = lm(SPY...Close ~ China.New.Daily.Cases, data = spy500_predictors)
print("China_daily_cases_model:")
summary(China_daily_cases_model)$adj.r.squared
summary(China_daily_cases_model)$r.squared
rmse(China_daily_cases_model)

US_unemployment_rate_model = lm(SPY...Close ~ US.Unemployment.Rate...., data = spy500_predictors)
print("US_unemployment_rate_model: Adj R-squared, R-Squared, RMSE")
summary(US_unemployment_rate_model)$adj.r.squared
summary(US_unemployment_rate_model)$r.squared
rmse(US_unemployment_rate_model)

US_GDP_model = lm(SPY...Close ~ US.Quarterly.GDP, data = spy500_predictors)
print("US_GDP_model: Adj R-squared, R-Squared, RMSE")
summary(US_GDP_model)$adj.r.squared
summary(US_GDP_model)$r.squared
rmse(US_GDP_model)

US_president_model = lm(SPY...Close ~ Sitting.President, data = spy500_predictors)
print("US_president_model: Adj R-squared, R-Squared, RMSE")
summary(US_president_model)$adj.r.squared
summary(US_president_model)$r.squared
rmse(US_president_model)

US_senate_majority_model = lm(SPY...Close ~ Senate.Majority.Party, data = spy500_predictors)
print("US_senate_majority_model: Adj R-squared, R-Squared, RMSE")
summary(US_senate_majority_model)$adj.r.squared
summary(US_senate_majority_model)$r.squared
rmse(US_senate_majority_model)

VIX_open_model = lm(SPY...Close ~ VIX...Open, data = spy500_predictors)
print("VIX_open_model: Adj R-squared, R-Squared, RMSE")
summary(VIX_open_model)$adj.r.squared
summary(VIX_open_model)$r.squared
rmse(VIX_open_model)


VIX_close_model = lm(SPY...Close ~ VIX...Close, data = spy500_predictors)
print("VIX_close_model: Adj R-squared, R-Squared, RMSE")
summary(VIX_close_model)$adj.r.squared
summary(VIX_close_model)$r.squared
rmse(VIX_close_model)

season_model = lm(SPY...Close ~ Season, data = spy500_predictors)
print("season_model: Adj R-squared, R-Squared, RMSE")
summary(season_model)$adj.r.squared
summary(season_model)$r.squared
rmse(season_model)
## [1] "volume_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.1250146
## [1] 0.1254382
## [1] 49.23804
## [1] "US_daily_cases_model: "
## [1] 0.1991564
## [1] 0.199544
## [1] 47.10578
## [1] "China_daily_cases_model:"
## [1] 0.02147443
## [1] 0.02194806
## [1] 52.06987
## [1] "US_unemployment_rate_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.06777032
## [1] 0.06822154
## [1] 50.82318
## [1] "US_GDP_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.8951023
## [1] 0.8951531
## [1] 17.04839
## [1] "US_president_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.7421466
## [1] 0.7422714
## [1] 26.72925
## [1] "US_senate_majority_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.4924407
## [1] 0.4926864
## [1] 37.50108
## [1] "VIX_open_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.05579382
## [1] 0.05625084
## [1] 51.14861
## [1] "VIX_close_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.0530402
## [1] 0.05349855
## [1] 51.22314
## [1] "season_model: Adj R-squared, R-Squared, RMSE"
## [1] 0.01239208
## [1] 0.01382617
## [1] 52.28562

This R^2, Adjusted R^2, and RMSE of each single predictor linear model are shown and we used these values to gain some intuition as to which predictors we wanted to model.

Naive Additive Model

Now that we have some basic understanding of the predictor variables being used and their general correlation visualized along with some statistics to quantify the correlation, we will begin our statistical modeling with a naive additive model which uses all of the predictor variables.

# The null model against which to compare our additive model in Anova Testing
null_model = lm(SPY...Close ~ 1, data = spy500_predictors)
# Creating our naive additive model which uses all of our predictors
naive_additive_model_full = lm(SPY...Close ~ SPY...Volume + US.New.Daily.Cases + US.New.Daily.Deaths + US.New.Daily.Tests + China.New.Daily.Cases + China.New.Daily.Deaths + US.Unemployment.Rate.... + US.Quarterly.GDP + Sitting.President + Senate.Majority.Party + VIX...Open + VIX...Close + Season, data = spy500_predictors)

# All assumptions of regression model seem to have been met based on these assumption checks
par(mfrow=c(2,2))
plot(naive_additive_model_full)

par(mfrow=c(1,1))

# Checking the results of our first naive additive linear regression model
predictions = predict(naive_additive_model_full)
plot(spy500_predictors$SPY...Close ~ 1)
lines(predictions ~ 1, col = "green")

summary(naive_additive_model_full)
## 
## Call:
## lm(formula = SPY...Close ~ SPY...Volume + US.New.Daily.Cases + 
##     US.New.Daily.Deaths + US.New.Daily.Tests + China.New.Daily.Cases + 
##     China.New.Daily.Deaths + US.Unemployment.Rate.... + US.Quarterly.GDP + 
##     Sitting.President + Senate.Majority.Party + VIX...Open + 
##     VIX...Close + Season, data = spy500_predictors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.016  -5.968  -0.397   6.027  55.911 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -3.737e+02  8.991e+00 -41.562  < 2e-16 ***
## SPY...Volume                     7.732e-09  6.379e-09   1.212 0.225615    
## US.New.Daily.Cases              -1.366e-04  4.706e-05  -2.904 0.003728 ** 
## US.New.Daily.Deaths              5.058e-03  1.395e-03   3.625 0.000296 ***
## US.New.Daily.Tests               4.619e-05  2.667e-06  17.320  < 2e-16 ***
## China.New.Daily.Cases            4.815e-03  5.506e-04   8.745  < 2e-16 ***
## China.New.Daily.Deaths           2.055e-02  7.454e-03   2.756 0.005896 ** 
## US.Unemployment.Rate....         4.433e+00  2.933e-01  15.114  < 2e-16 ***
## US.Quarterly.GDP                 3.118e-02  4.536e-04  68.745  < 2e-16 ***
## Sitting.PresidentDonald Trump    3.688e+00  9.546e-01   3.863 0.000115 ***
## Senate.Majority.PartyRepublican -1.434e+00  8.068e-01  -1.778 0.075632 .  
## VIX...Open                      -4.945e-01  1.357e-01  -3.643 0.000276 ***
## VIX...Close                     -6.364e-01  1.424e-01  -4.470 8.26e-06 ***
## SeasonSpring                     1.500e+00  6.135e-01   2.444 0.014592 *  
## SeasonSummer                     4.811e-01  5.932e-01   0.811 0.417478    
## SeasonWinter                     2.240e+00  6.262e-01   3.577 0.000356 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.612 on 2051 degrees of freedom
## Multiple R-squared:  0.9669, Adjusted R-squared:  0.9667 
## F-statistic:  3998 on 15 and 2051 DF,  p-value: < 2.2e-16
anova(null_model, naive_additive_model_full)
## Analysis of Variance Table
## 
## Model 1: SPY...Close ~ 1
## Model 2: SPY...Close ~ SPY...Volume + US.New.Daily.Cases + US.New.Daily.Deaths + 
##     US.New.Daily.Tests + China.New.Daily.Cases + China.New.Daily.Deaths + 
##     US.Unemployment.Rate.... + US.Quarterly.GDP + Sitting.President + 
##     Senate.Majority.Party + VIX...Open + VIX...Close + Season
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   2066 5729959                                  
## 2   2051  189480 15   5540479 3998.1 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on what we observe in the summary of the additive model, the anova tests, and the plot, we have a decent model with a relatively good R^2 value which we can continue to build open. We further confirmed that all of the predictors we were using were significant by individually removing them from the overall model and performing ANOVA tests to conclude that the model would be significantly different with a lower adjusted R-squared value if we removed any one of the predictors. This meant that our full additive model was the best one we had up to this point.

Polynomial Additive Modeling

Here we intentionally created an over-fitted model so that we could figure out what polynomial of each predictor variable would be best for our model and then went ahead and chose the appropriate polynomials.

overfitted_model = lm(SPY...Close ~ poly(SPY...Volume, 10) + poly(US.New.Daily.Cases, 10) + poly(US.New.Daily.Deaths, 10) + poly(US.New.Daily.Tests, 10) + poly(China.New.Daily.Cases, 1) + poly(China.New.Daily.Deaths, 1) + poly(US.Unemployment.Rate...., 10) + poly(US.Quarterly.GDP, 10) + Sitting.President + Senate.Majority.Party + poly(VIX...Open, 10) + Season, data = spy500_predictors)
summary(overfitted_model)$adj.r.squared
## [1] 0.9894362
rmse(overfitted_model)
## [1] 5.309677
par(mfrow=c(2,2))
plot(overfitted_model)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow=c(1,1))

predictions = predict(overfitted_model)
plot(spy500_predictors$SPY...Close ~ 1)
lines(predictions ~ 1, col = "green")

As we can see, our over-fitted model violates some assumptions and is clearly over-fitted in the prediction graph.

Polynomial and Interactive Modeling

Now that we had figured out which polynomials of predictors to use we were ready to test various interactions until we found the most ideal model for our purpose.

# all assumptions met
adjusted_polynomial_add_model = lm(SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 5) + poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + poly(VIX...Open, 5) + Season, data = spy500_predictors)
summary(adjusted_polynomial_add_model)$adj.r.squared
## [1] 0.9852856
rmse(adjusted_polynomial_add_model)
## [1] 6.33082
par(mfrow=c(2,2))
plot(adjusted_polynomial_add_model)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow=c(1,1))

# all assumptions met - except a few influential points
adjusted_polynomial_interactive_model = lm(SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 5)*poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + poly(VIX...Open, 5) + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model)$adj.r.squared
## [1] 0.9854494
rmse(adjusted_polynomial_interactive_model)
## [1] 6.272174
par(mfrow=c(2,2))
plot(adjusted_polynomial_interactive_model)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow=c(1,1))

# Very tighly fit model violating assumptions
adjusted_polynomial_interactive_model_2 = lm(SPY...Close ~ poly(SPY...Volume, 2)*poly(VIX...Open, 5) + poly(US.New.Daily.Cases, 5)*poly(US.New.Daily.Deaths, 3)*poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases*China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model_2)$adj.r.squared
## [1] 0.9875328
rmse(adjusted_polynomial_interactive_model_2)
## [1] 5.723089
par(mfrow=c(2,2))
plot(adjusted_polynomial_interactive_model_2)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow=c(1,1))

# another model violating assumptions
adjusted_polynomial_interactive_model_3 = lm(SPY...Close ~ poly(SPY...Volume, 2)*poly(VIX...Open, 5) + poly(US.New.Daily.Cases, 5)*poly(US.New.Daily.Deaths, 3)*poly(US.New.Daily.Tests, 2) + China.New.Daily.Cases*China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6)*poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model_3)$adj.r.squared
## [1] 0.9942734
rmse(adjusted_polynomial_interactive_model_3)
## [1] 3.84295
par(mfrow=c(2,2))
plot(adjusted_polynomial_interactive_model_3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow=c(1,1))

# Here is our final model with only has less than 5 influential points out of over 2000
adjusted_polynomial_interactive_model_4 = lm(SPY...Close ~ poly(SPY...Volume, 2) + poly(VIX...Open, 5) + I(SPY...Volume ^ 2)*I(VIX...Open^5) + poly(US.New.Daily.Cases, 5) + poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 2) + I(US.New.Daily.Cases^5)*I(US.New.Daily.Deaths^3)*I(US.New.Daily.Tests^2)  + China.New.Daily.Cases*China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + poly(US.Quarterly.GDP, 6) + I(US.Unemployment.Rate.... ^ 6)*I(US.Quarterly.GDP^6) + Sitting.President + Senate.Majority.Party + Season, data = spy500_predictors)
summary(adjusted_polynomial_interactive_model_4)$adj.r.squared
## [1] 0.9865593
rmse(adjusted_polynomial_interactive_model_4)
## [1] 6.038683
par(mfrow=c(2,2))
plot(adjusted_polynomial_interactive_model_4)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

par(mfrow=c(1,1))

anova(adjusted_polynomial_add_model, adjusted_polynomial_interactive_model)
## Analysis of Variance Table
## 
## Model 1: SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 
##     5) + poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 
##     2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 
##     6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + 
##     poly(VIX...Open, 5) + Season
## Model 2: SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 
##     5) * poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 
##     2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 
##     6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + 
##     poly(VIX...Open, 5) + Season
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1   2030 82844                                
## 2   2015 81316 15    1527.8 2.5238 0.001024 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(adjusted_polynomial_interactive_model, adjusted_polynomial_interactive_model_4)
## Analysis of Variance Table
## 
## Model 1: SPY...Close ~ poly(SPY...Volume, 2) + poly(US.New.Daily.Cases, 
##     5) * poly(US.New.Daily.Deaths, 3) + poly(US.New.Daily.Tests, 
##     2) + China.New.Daily.Cases + China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 
##     6) + poly(US.Quarterly.GDP, 6) + Sitting.President + Senate.Majority.Party + 
##     poly(VIX...Open, 5) + Season
## Model 2: SPY...Close ~ poly(SPY...Volume, 2) + poly(VIX...Open, 5) + I(SPY...Volume^2) * 
##     I(VIX...Open^5) + poly(US.New.Daily.Cases, 5) + poly(US.New.Daily.Deaths, 
##     3) + poly(US.New.Daily.Tests, 2) + I(US.New.Daily.Cases^5) * 
##     I(US.New.Daily.Deaths^3) * I(US.New.Daily.Tests^2) + China.New.Daily.Cases * 
##     China.New.Daily.Deaths + poly(US.Unemployment.Rate...., 6) + 
##     poly(US.Quarterly.GDP, 6) + I(US.Unemployment.Rate....^6) * 
##     I(US.Quarterly.GDP^6) + Sitting.President + Senate.Majority.Party + 
##     Season
##   Res.Df   RSS Df Sum of Sq F Pr(>F)
## 1   2015 81316                      
## 2   2022 75375 -7    5941.5

Based on our results, the best model we have generated so far is the adjusted_polynomial_interactive_model_2. We will not check how effective this model is at predicting the same time series.

predictions = predict(adjusted_polynomial_interactive_model_4)
plot(spy500_predictors$SPY...Close ~ 1, xlab = "Days past 8/27/12", ylab = "SPY Close",)
lines(predictions ~ 1, col = "green")

The final model which we select appears to do a rather good job of modeling the actual price of the SPY 500. As we can see, there have recently been times when the SPY 500 overreacts to corona virus scares, but overall, the economic indicators and COVID measures chosen seem to be fairly correlated with the stock’s price.

Results

We have chosen adjusted_polynomial_interactive_model_4 as our final model. However, we will see that there is an issue.

par(mfrow = c(1,2))
plot(adjusted_polynomial_interactive_model_4, which = 1, col = "dodgerblue", pch = 20)
plot(adjusted_polynomial_interactive_model_4, which = 2, col = "dodgerblue", pch = 20)

adjusted_polynomial_interactive_model_4 does not violate the constant variance, linearity, or normality. While our Q-Q plot tapers off slightly, it is due to the enormous amount of data points we consider.

library(faraway)
vif(adjusted_polynomial_interactive_model_4)
##                                                   poly(SPY...Volume, 2)1 
##                                                             2.452963e+00 
##                                                   poly(SPY...Volume, 2)2 
##                                                             1.486178e+00 
##                                                     poly(VIX...Open, 5)1 
##                                                             1.593967e+01 
##                                                     poly(VIX...Open, 5)2 
##                                                             1.708614e+01 
##                                                     poly(VIX...Open, 5)3 
##                                                             4.296714e+00 
##                                                     poly(VIX...Open, 5)4 
##                                                             1.854856e+00 
##                                                     poly(VIX...Open, 5)5 
##                                                             2.019789e+00 
##                                                        I(SPY...Volume^2) 
##                                                             7.308237e+37 
##                                                          I(VIX...Open^5) 
##                                                             4.778595e+21 
##                                             poly(US.New.Daily.Cases, 5)1 
##                                                             4.210624e+01 
##                                             poly(US.New.Daily.Cases, 5)2 
##                                                             1.046380e+01 
##                                             poly(US.New.Daily.Cases, 5)3 
##                                                             5.490332e+00 
##                                             poly(US.New.Daily.Cases, 5)4 
##                                                             3.143398e+01 
##                                             poly(US.New.Daily.Cases, 5)5 
##                                                             5.054705e+00 
##                                            poly(US.New.Daily.Deaths, 3)1 
##                                                             3.325179e+00 
##                                            poly(US.New.Daily.Deaths, 3)2 
##                                                             1.289556e+02 
##                                            poly(US.New.Daily.Deaths, 3)3 
##                                                             1.390292e+01 
##                                             poly(US.New.Daily.Tests, 2)1 
##                                                             1.719547e-08 
##                                             poly(US.New.Daily.Tests, 2)2 
##                                                             6.634159e-07 
##                                                  I(US.New.Daily.Cases^5) 
##                                                             7.017566e+54 
##                                                 I(US.New.Daily.Deaths^3) 
##                                                             1.130145e+25 
##                                                  I(US.New.Daily.Tests^2) 
##                                                             9.224942e+27 
##                                                    China.New.Daily.Cases 
##                                                             4.825236e+09 
##                                                   China.New.Daily.Deaths 
##                                                             1.148448e+07 
##                                       poly(US.Unemployment.Rate...., 6)1 
##                                                             5.008945e+00 
##                                       poly(US.Unemployment.Rate...., 6)2 
##                                                             2.878927e+27 
##                                       poly(US.Unemployment.Rate...., 6)3 
##                                                             9.654434e+25 
##                                       poly(US.Unemployment.Rate...., 6)4 
##                                                             8.127351e+23 
##                                       poly(US.Unemployment.Rate...., 6)5 
##                                                             2.614995e+21 
##                                       poly(US.Unemployment.Rate...., 6)6 
##                                                             2.377530e+18 
##                                               poly(US.Quarterly.GDP, 6)1 
##                                                             3.155056e+14 
##                                               poly(US.Quarterly.GDP, 6)2 
##                                                             1.984918e-27 
##                                               poly(US.Quarterly.GDP, 6)3 
##                                                             2.810640e-02 
##                                               poly(US.Quarterly.GDP, 6)4 
##                                                             2.609928e-02 
##                                               poly(US.Quarterly.GDP, 6)5 
##                                                             4.996574e-03 
##                                               poly(US.Quarterly.GDP, 6)6 
##                                                             4.660478e-03 
##                                            I(US.Unemployment.Rate....^6) 
##                                                             1.389896e+13 
##                                                    I(US.Quarterly.GDP^6) 
##                                                             1.164150e+02 
##                                            Sitting.PresidentDonald Trump 
##                                                             7.103104e-66 
##                                          Senate.Majority.PartyRepublican 
##                                                             1.179825e-72 
##                                                             SeasonSpring 
##                                                             6.361464e-42 
##                                                             SeasonSummer 
##                                                             1.415238e-10 
##                                                             SeasonWinter 
##                                                             5.445415e-62 
##                                        I(SPY...Volume^2):I(VIX...Open^5) 
##                                                             1.352156e-39 
##                         I(US.New.Daily.Cases^5):I(US.New.Daily.Deaths^3) 
##                                                             1.089915e+71 
##                          I(US.New.Daily.Cases^5):I(US.New.Daily.Tests^2) 
##                                                             5.100804e+76 
##                         I(US.New.Daily.Deaths^3):I(US.New.Daily.Tests^2) 
##                                                             5.550326e+45 
##                             China.New.Daily.Cases:China.New.Daily.Deaths 
##                                                             2.657590e+14 
##                      I(US.Unemployment.Rate....^6):I(US.Quarterly.GDP^6) 
##                                                             3.691105e+67 
## I(US.New.Daily.Cases^5):I(US.New.Daily.Deaths^3):I(US.New.Daily.Tests^2) 
##                                                             3.199532e+95

The issue lies in some predictor variables having extremely high VIF values and therefore adjusted_polynomial_interactive_model_4 suffers a bit from multicollinearity. We will use Akaike Information Criterion to improve upon our model to arrive at final_mod.

final_mod = step(adjusted_polynomial_interactive_model_4, trace = 0, direction = "backward")
par(mfrow = c(1, 2))
plot(final_mod, which = 1, col = "dodgerblue", pch = 20)
plot(final_mod, which = 2, col = "dodgerblue", pch = 20)

final_mod does not violate the constant variance, linearity, or normality.

summary(final_mod)$adj.r.squared
## [1] 0.9865657
summary(final_mod)$adj.r.squared > summary(adjusted_polynomial_interactive_model_4)$adj.r.squared
## [1] TRUE
sum(vif(final_mod) > 5) < sum(vif(adjusted_polynomial_interactive_model_4) > 5)
## [1] TRUE
extractAIC(final_mod) < extractAIC(adjusted_polynomial_interactive_model_4)
## [1] TRUE TRUE
extractAIC(final_mod, k = log(length(resid(final_mod)))) < extractAIC(adjusted_polynomial_interactive_model_4, k = log(length(resid(adjusted_polynomial_interactive_model_4))))
## [1] TRUE TRUE

We can see that our final_mod outperforms adjusted_polynomial_interactive_model_4.

Discussion

From our results we obtain that VIX open, SPY volume, US new daily cases, US new daily deaths, US new daily tests, China new daily cases, China new daily deaths, US unemployment rate, US quarterly gdp, sitting President, Senate majority party, and the seasons are significant predictors for the response of the SPY Closing Price. We omitted VIX Close and the season of fall, since these variables either were dropped in the final step function we used, or caused multiple test statistics to fluctuate (too high of a p-value, R squared value). We also determined that China New Daily Tests were not useful in the creation of various models/plots for our data since there were many sporadic and unnatural patterns within it, so we did not include that either. A certain aspect of our analysis that created cause for concern however, was that it potentially overfits the data, which we determined by observing our predictive model plot that we created. This was potentially caused by the various transformations we performed on our initial model. Our final model is useful to where we are able to use it to predict our response (the SPY Close) for future days where the stock market is open, therefore achieving our goal in finding out which of our predictor variables significantly or insignificantly impacted the stock market in the scope of out dataset.

Appendix

We chose not to data dump the summaries of each individual regression models here because based off of our evaluations, all of the information pertinent to our project has already been included in the methods section.

Names

Jiesen Zhang (jiesenz2)

Akhil Bhamidipati (akhilsb2)

Raymond Moy (rymoy2)

Ayan Bhowmick (ayanb2)